Understanding Content Reuse on the Web: Static and Dynamic Analyses

نویسندگان

  • Ricardo A. Baeza-Yates
  • Álvaro R. Pereira
  • Nivio Ziviani
چکیده

In this paper we present static and dynamic studies of duplicate and near-duplicate documents in the Web. The static and dynamic studies involve the analysis of similar content among pages within a given snapshot of the Web and how pages in an old snapshot are reused to compose new documents in a more recent snapshot. We ran a series of experiments using four snapshots of the Chilean Web. In the static study, we identify duplicates in both parts of the Web graph – reachable (connected by links) and unreachable components (unconnected) – aiming to identify where duplicates occur more frequently. We show that the number of duplicates in the Web seems to be much higher than previously reported (about 50% higher) and in our data the duplicated in the unreachable Web is 74,6% higher than the number of duplicates in the reachable component of the Web graph. In the dynamic study, we show that some of the old content is used to compose new pages. If a page in a newer snapshot has content of a page in an older snapshot, we say that the source is a parent of the new page. We state the hypothesis that people use search engines to find pages and republish their content as a new document. We present evidences that this happens for part of the pages that have parents. In this case, part of the Web content is biased by the ranking function of search engines.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بهینه‌سازی اجرا و پاسخ صفحات وب در فضای ابری با روش‌های پیش‌پردازش، مطالعه موردی سامانه‌های وارنیش و انجینکس

The response speed of Web pages is one of the necessities of information technology. In recent years, renowned companies such as Google and computer scientists focused on speeding up the web. Achievements such as Google Pagespeed, Nginx and varnish are the result of these researches. In Customer to Customer(C2C) business systems, such as chat systems, and in Business to Customer(B2C) systems, s...

متن کامل

Improve Replica Placement in Content Distribution Networks with Hybrid Technique

The increased using of the Internet and its accelerated growth leads to reduced network bandwidth and the capacity of servers; therefore, the quality of Internet services is unacceptable for users while the efficient and effective delivery of content on the web has an important role to play in improving performance. Content distribution networks were introduced to address this issue. Replicatin...

متن کامل

Investigating The Seismic Response of Structural Walls Using Nonlinear Static and Incremental Dynamic Analyses

Structural walls commonly used as efficient structural elements to resist lateral and vertical loads. Diverse performance of bearing wall system in past earthquakes, motivates investigation on the adequacy of current seismic design provision for these walls. This study considers seismic performance of model walls of bearing wall and building frame systems designed as ordinary and special struct...

متن کامل

Accelerating Dynamic Web Content Delivery Using Keyword-Based Fragment Detection

Recent advances in Web engineering have enabled the rapid growth of dynamic Web services such as Web-based email, online banking, online shopping and entertainment. We envision that finding an effective way to deliver these dynamic Web services and understanding the relationship between Web application design and delivery are two important Web engineering issues, and have not been seriously con...

متن کامل

Workload Characterization of a Personalized Web Site — And Its Implications for Dynamic Content Caching

Requests for dynamic and personalized content increasingly dominate current-day Internet traffic; however, traditional caching architectures are not well-suited to cache such content. Several recently proposed techniques, which exploit reuse at the sub-document level, promise to address this shortcoming, but require a better understanding of the workloads seen on web sites that serve such conte...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006